4 - Compositional Analysis

Author

CDN team

Published

March 8, 2024

Introduction

In this notebook we will study which cell types compose different samples in single-cell data. We will do this using a subset of statistics called compositional data analysis (or CoDA for short), because single-cell experiments are samples of a larger tissue, and the cells we obtain represents the proportions of cell types in the tissue, but does not reflect their true abundance.

CoDA relativizes proportions by comparing them to a stable population, comparing them to each other, or to a mean across samples, all resulting in a table of relative abundances of cell types per sample. This sample x species matrix is a natural foundation for conducting case-control analysis, as we have a single vector of variables per sample. Here, we can scrutinize variances between conditions and between samples.

Useful Resources

  • John Aitchison’s Compositional Data Analysis (J. Aitchison 1982)
  • scCODA, a python + scanpy package for ALR compositional analysis (Büttner et al. 2021)
  • Cacoa, an R package for case-control analysis that uses ILR
  • compositions, an R package for CoDA
  • A paper describing the compositionality problem in terms of microbiome studies (Morton et al. 2019)

Key Takeaways

  • Compositions provide a good foundation for comparing individual samples
  • The compositional nature of scRNAseq means only conclusions about relative values can be made, and overall cell density can bias results
  • Compositional transforms allow quantitative assessment of scRNA abundances

Libraries

Installation

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages('tidyverse')
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages('Seurat')
if (!requireNamespace("compositions", quietly = TRUE))
    install.packages('compositions')
if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages('colorBlindness')
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
if (!requireNamespace("EnhancedVolcano", quietly = TRUE))
    BiocManager::install("EnhancedVolcano")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")
if (!requireNamespace("scales", quietly = TRUE))
    install.packages('scales')
if (!requireNamespace("viridis", quietly = TRUE))
    install.packages('viridis')
if (!requireNamespace("DT", quietly = TRUE))
    install.packages('DT')
if (!requireNamespace("reshape2", quietly = TRUE))
    install.packages('reshape2')

Load Libraries

library(colorBlindness)
library(tidyverse)
library(EnhancedVolcano)
library(viridis)
library(scales)
library(DT)
library(Seurat)
library(compositions)
library(reshape2)
library(ComplexHeatmap)

Load data

srobj <- readRDS('/Users/kylekimler/Downloads/d8e35450-de43-451a-9979-276eac688bce.rds')
genes <- read_csv('/Users/kylekimler/CDN/workshops/workshop1/data/cov_flu_gene_names_table.csv') # Load a provided gene conversion table to convert ENSG to readable gene symbols

# Need to remake seurat object
mtx <- srobj@assays$RNA@data
rownames(mtx) <- genes[match(row.names(mtx),genes$index),]$feature_name

srobj <- CreateSeuratObject(counts = mtx, meta.data = srobj@meta.data)

srobj
An object of class Seurat 
33234 features across 59572 samples within 1 assay 
Active assay: RNA (33234 features, 0 variable features)
 1 layer present: counts
set.seed(687)

Other setup

Generate a color palette for plotting

# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
# nb.cols <- length(unique(se$Celltype))
# mycolors <- colorRampPalette(paletteMartin)(nb.cols)
pal <- paletteMartin
names(pal) <- sort(unique(srobj$Celltype))

Composition of clusters per sample

In papers we often see percentages of clusters per sample or of subclusters in a larger celltype. These are done using stacked bar plots as follows.

celltypePercentagesDF <- srobj@meta.data %>% count(`Sample ID`,Celltype) %>% group_by(`Sample ID`) %>% reframe(Celltype, celltype_n = n, total_n_cells = sum(n)) %>% mutate(pct_celltype = celltype_n / total_n_cells)

datatable(celltypePercentagesDF)
celltypePercentagesDF %>% ggplot(aes(x=`Sample ID`, y=pct_celltype, fill = Celltype), color='white') + 
    geom_bar(position='stack', stat='identity') + 
    scale_fill_manual(values = pal) + 
    ggtitle('Composition of each sample by celltype') +
    theme_linedraw(base_size = 20) +
    theme(axis.text.x = element_text(angle=45,hjust=1,vjust=1))

Comparing compositions with Wilcoxon tests

And often these compositions are compared using rank sum tests, which Morton et al (Morton et al. 2019) show is a great way to avoid the compositionality problem. The most common rank sum test is the Wilcoxon test, the default test Seurat uses for comparisons of genes across clusters with FindAllMarkers.

If we use the Wilcoxon test here in a comparison of compositions across samples Flu vs. Normal, we will see that the result depends on proper normalization of the data.

# Create a function to perform Wilcoxon rank-sum tests and return p-values
perform_wilcox_test <- function(data, group_col, group1, group2, species_col) {
  group1_data <- data[data[[group_col]] == group1, species_col]
  group2_data <- data[data[[group_col]] == group2, species_col]
  
  test_result <- wilcox.test(group1_data, group2_data)
  return(data.frame(p.val = test_result$p.value, statistic = test_result$statistic, species = species_col, test = paste(group1,'vs',group2)))
}

rawTb <- celltypePercentagesDF %>% dplyr::select(sample=`Sample ID`, Celltype, celltype_n) %>% pivot_wider(names_from=Celltype, 
values_from= celltype_n, values_fill=0) %>% separate(sample, sep = " ", into=c('group','sample_n'), remove=FALSE) %>% column_to_rownames("sample") %>% select(-sample_n)

# Create a long dataframe of raw values for comparisons
raw_long <- rawTb %>% 
  pivot_longer(-group, names_to="species", values_to="counts") %>%
  mutate(transform = "Raw Counts")

species_col_names <- colnames(rawTb)[2:ncol(rawTb)]

wilcoxon_raw <- lapply(species_col_names, function(species) perform_wilcox_test(rawTb, group_col = "group", group1 = 'Flu', group2 = 'Normal', species_col = species))

wilcoxon_raw <- wilcoxon_raw %>% bind_rows %>% data.frame %>% mutate(transform = "Raw")

pctTb <- celltypePercentagesDF %>% dplyr::select(sample=`Sample ID`, Celltype, pct_celltype) %>% pivot_wider(names_from=Celltype, 
values_from=pct_celltype, values_fill=0) %>% separate(sample, sep = " ", into=c('group','sample_n'), remove=FALSE) %>% column_to_rownames("sample") %>% select(-sample_n) # reorganize dataframe for wilcoxon tests

# Create a long dataframe of percent values for comparisons
pct_long <- pctTb %>% 
  pivot_longer(-group, names_to="species", values_to="counts") %>%
  mutate(transform = "Percentages")

wilcoxon_pct <- lapply(species_col_names, function(species) perform_wilcox_test(pctTb, group_col = "group", group1 = 'Flu', group2 = 'Normal', species_col = species)) 

wilcoxon_pct <- wilcoxon_pct %>% bind_rows %>% data.frame %>% mutate(transform = "Percent")

combined_data <- bind_rows(raw_long, pct_long)

ggplot(combined_data %>% filter(group %in% c('Flu','Normal')), aes(x = species, y=counts, fill = group)) +
  geom_boxplot() +
  facet_wrap(~ transform, scales = "free_y") +
  labs(title = "Comparison of Species Counts per Group",
       x = "Species",
       y = "Counts / Percentages") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        legend.title = element_blank()) +
  scale_fill_manual(values = unname(pal[c(1,4)]))

wilcoxon_comparisons <- bind_rows(wilcoxon_raw,wilcoxon_pct)

ggplot(wilcoxon_comparisons, aes(x = reorder(species, p.val), y = p.val, color = p.val < 0.05)) +
  geom_point() + # Use geom_bar() for bar plots
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "blue"), name = "p-val < 0.05") +
  facet_wrap(~ transform, scales = "free_x") +
  labs(x = "Species", title = "Comparison of P-Values Across Species and Transformations") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "top") +
  guides(fill = guide_legend(title = "P-value < 0.05"))

Compositional transforms

The compositions package provides compositional transforms that allow quantitative comparisons of compositional data. In the most popular package for compositional analysis in scRNA, scCODA, the additive-log-ratio is used, where compositions are transformed to log-ratios of the cluster with the least dispersion that is present across 95% of all samples.

\text{ALR}(x_i) = \log\left(\frac{x_i}{x_D}\right)

where:

  • () denotes the natural logarithm,
  • (x_i) is the value of component (i) in the composition,
  • (x_D) is the chosen reference component from the compositional dataset.

\text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right)

where:

  • \log denotes the natural logarithm,
  • x_i is the value of component i in the composition,
  • g(x) is the geometric mean of all components in the composition, calculated as g(x) = \left(\prod_{i=1}^{D} x_i\right)^{\frac{1}{D}}, with D being the total number of components in the composition.

The CLR is very easy to use because it does not require a reference. We can calculate the ALR in a similar way to what scCODA does by choosing the reference cluster based on minimal dispersion and at least 95% presence across samples

compTb <- pctTb

clrTb <- compositions::clr(compTb[,-1]) %>% data.frame
colnames(clrTb) <- species_col_names
clrTb$group <- compTb$group 

# Create a long dataframe of raw values for comparisons
clr_long <- clrTb %>% 
  pivot_longer(-group, names_to="species", values_to="counts") %>%
  mutate(transform = "CLR")

wilcoxon_clr <- lapply(species_col_names, function(species) perform_wilcox_test(clrTb, group_col = "group", group1 = 'Flu', group2 = 'Normal', species_col = species))

wilcoxon_clr <- wilcoxon_clr %>% bind_rows %>% data.frame %>% mutate(transform = "CLR")

presence_threshold <- 0.95
presence <- apply(compTb > 0, 2, function(x) mean(x)) > presence_threshold
# make sure group column isn't included
presence[1] <- FALSE

iqr_values <- apply(compTb[, presence], 2, IQR)
denominator_index <- which.min(iqr_values)
denominator_name <- names(iqr_values[denominator_index])

alrTb <- compositions::alr(rawTb[,-1] + 1, #apply a tiny pseudocount to avoid logarithmizing a 0
ivar = denominator_name) %>% data.frame
alr_col_names <- species_col_names[species_col_names!=denominator_name]
colnames(alrTb) <- alr_col_names
alrTb$group <- compTb$group 

alr_long <- alrTb %>% 
  pivot_longer(-group, names_to="species", values_to="counts") %>%
  mutate(transform = "ALR")

combined_data <- bind_rows(combined_data, clr_long, alr_long)

ggplot(combined_data %>% filter(group %in% c('Flu','Normal')), aes(x = species, y=counts, fill = group)) +
  geom_boxplot() +
  labs(title = "Comparison of Species Counts per Group",
       x = "Species",
       y = "values") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        legend.title = element_blank()) +
  scale_fill_manual(values = unname(pal[c(1,4)])) +
  facet_wrap(~ transform, scales = "free_y")

wilcoxon_alr <- lapply(alr_col_names, function(species) perform_wilcox_test(pctTb, group_col = "group", group1 = 'Flu', group2 = 'Normal', species_col = species))

wilcoxon_alr <- wilcoxon_alr %>% bind_rows %>% data.frame %>% mutate(transform = "ALR")

wilcoxon_comparisons <- bind_rows(wilcoxon_comparisons,wilcoxon_clr,wilcoxon_alr)

ggplot(wilcoxon_comparisons, aes(x = reorder(species, p.val), y = p.val, color = p.val < 0.05)) +
  geom_point() + # Use geom_bar() for bar plots
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "blue"), name = "p-val < 0.05") +
  facet_wrap(~ transform, scales = "free_x") +
  labs(x = "Species", title = "Comparison of P-Values Across Species and Transformations") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "top") +
  guides(fill = guide_legend(title = "P-value < 0.05"))

So we can see already that Wilcoxon results of the transforms are different than percentage and raw

Comparing samples with compositions

PCA

alrPCA <- prcomp(alrTb[,-15])
clrPCA <- prcomp(clrTb[,-16])
rawPCA <- prcomp(rawTb[,-1])
pctPCA <- prcomp(pctTb[,-1])

pca_alr <- data.frame(alrPCA$x, group = alrTb$group, transform = alr_long$transform %>% unique)
pca_clr <- data.frame(clrPCA$x, group = clrTb$group, transform = clr_long$transform %>% unique)
pca_pct <- data.frame(pctPCA$x, group = pctTb$group, transform = pct_long$transform %>% unique)
pca_raw <- data.frame(rawPCA$x, group = rawTb$group, transform = raw_long$transform %>% unique)

pcaDF <- bind_rows(pca_alr, pca_clr, pca_pct, pca_raw)

pcaDF <- pcaDF %>% mutate(sample = row.names(pcaDF) %>% str_replace_all('\\.\\.\\.(.*)',''))

ggplot(pcaDF, aes(x = PC1, y = PC2, color = group, label = sample)) +
  geom_point(size=4) + # Use geom_bar() for bar plots 
  geom_text_repel() +
  facet_wrap(~ transform, scales = "free") +
  labs(title = "PCA Across Transformations") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  guides(fill = guide_legend(title = "P-value < 0.05")) +
  scale_color_manual(values = unname(pal[c(4,6,9)]))

loadings_alr <- data.frame(alrPCA$rotation, transform = alr_long$transform %>% unique)
loadings_clr <- data.frame(clrPCA$rotation, transform = clr_long$transform %>% unique)
loadings_pct <- data.frame(pctPCA$rotation, transform = pct_long$transform %>% unique)
loadings_raw <- data.frame(rawPCA$rotation, transform = raw_long$transform %>% unique)

loadingsDF <- bind_rows(loadings_alr, loadings_clr, loadings_pct, loadings_raw) %>% rownames_to_column('species') %>% group_by(transform) %>% arrange(PC1)


loadingsDF %>% ggplot(aes(x=PC1, y=factor(species, levels=loadingsDF$species), fill = PC1 < 0)) + geom_bar(stat='identity') + facet_wrap(.~transform, scales = 'free') + theme_linedraw(base_size=20) + scale_fill_manual(values = unname(pal[c(6,11)]))# + scale_y_discrete(labels = loadingsDF$species %>% str_replace_all('\\.\\.\\.(.*)',''))

Distances between samples

We can also directly calculate distances between samples based on sample composition space. This can give us a better idea about outliers and how samples group together. We can also use any of the clustering metrics to test coherence of groups.

In compositional analysis, the method for calculating distances between samples proposed by John Aitchison (John Aitchison 2005) remains the a popular method. Termed Aitchison Distance, it is the Euclidean distance between samples based on CLR transformed counts of species.

clr_distances <- dist(clrTb[,-16], method = "euclidean")

distdf <- reshape2::melt(as.matrix(clr_distances), varnames = c("from", "to"))

datatable(distdf)
distlong <- distdf %>% separate(from,c('group1','sample1'),sep=' ', remove=FALSE) %>% 
            separate(to,c('group2','sample2'),sep=' ', remove=FALSE)
            
distlong %>% filter(sample1 !=sample2) %>% mutate(distance_group = ifelse(group1==group2,'between same disease individuals',ifelse(group1 == 'Normal' | group2 == 'Normal', 'between disease and healthy', 'between Covid and Flu'))) %>% 
    ggplot(aes(x=value,fill=distance_group)) + geom_density(alpha=0.4) + 
    ggtitle('COVID-Flu study distances\nAitchison Distances') +
    theme_bw(base_size = 20) + scale_fill_manual(values = unname(pal[c(3,6,8)])) + xlab('euclidean distance')

ComplexHeatmap::Heatmap(as.matrix(clr_distances))

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ComplexHeatmap_2.18.0  reshape2_1.4.4         compositions_2.0-8     Seurat_5.0.2           SeuratObject_5.0.1     sp_2.1-3               DT_0.32                scales_1.3.0           viridis_0.6.5          viridisLite_0.4.2      EnhancedVolcano_1.20.0 ggrepel_0.9.5          lubridate_1.9.3        forcats_1.0.0          stringr_1.5.1          dplyr_1.1.4            purrr_1.0.2            readr_2.1.5            tidyr_1.3.1            tibble_3.2.1           ggplot2_3.5.0          tidyverse_2.0.0        colorBlindness_0.1.9   BiocManager_1.30.22    knitr_1.45            

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22       splines_4.3.1          later_1.3.2            polyclip_1.10-6        fastDummies_1.7.3      lifecycle_1.0.4        doParallel_1.0.17      globals_0.16.3         lattice_0.22-5         vroom_1.6.5            MASS_7.3-60.0.1        crosstalk_1.2.1        magrittr_2.0.3         sass_0.4.8             plotly_4.10.4          rmarkdown_2.26         jquerylib_0.1.4        yaml_2.3.8             httpuv_1.6.14          sctransform_0.4.1      spam_2.10-0            spatstat.sparse_3.0-3  reticulate_1.35.0      cowplot_1.1.3          pbapply_1.7-2          bayesm_3.1-6           RColorBrewer_1.1-3     abind_1.4-5            Rtsne_0.17             BiocGenerics_0.48.1    tensorA_0.36.2.1       circlize_0.4.16        IRanges_2.36.0         S4Vectors_0.40.2       irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.0-4   goftest_1.2-3          RSpectra_0.16-1        spatstat.random_3.2-3  fitdistrplus_1.1-11    parallelly_1.37.1      leiden_0.4.3.1         codetools_0.2-19       tidyselect_1.2.0       shape_1.4.6.1          farver_2.1.1           matrixStats_1.2.0      stats4_4.3.1           spatstat.explore_3.2-6 jsonlite_1.8.8        
 [52] GetoptLong_1.0.5       ellipsis_0.3.2         progressr_0.14.0       ggridges_0.5.6         survival_3.5-8         iterators_1.0.14       foreach_1.5.2          tools_4.3.1            ica_1.0-3              Rcpp_1.0.12            glue_1.7.0             gridExtra_2.3          xfun_0.42              withr_3.0.0            fastmap_1.1.1          fansi_1.0.6            digest_0.6.34          timechange_0.3.0       R6_2.5.1               mime_0.12              gridGraphics_0.5-1     colorspace_2.1-0       scattermore_1.2        tensor_1.5             spatstat.data_3.0-4    utf8_1.2.4             generics_0.1.3         data.table_1.15.2      robustbase_0.99-2      httr_1.4.7             htmlwidgets_1.6.4      uwot_0.1.16            pkgconfig_2.0.3        gtable_0.3.4           lmtest_0.9-40          htmltools_0.5.7        dotCall64_1.1-1        clue_0.3-65            png_0.1-8              tzdb_0.4.0             rjson_0.2.21           nlme_3.1-164           cachem_1.0.8           zoo_1.8-12             GlobalOptions_0.1.2    KernSmooth_2.23-22     parallel_4.3.1         miniUI_0.1.1.1         pillar_1.9.0           vctrs_0.6.5            RANN_2.6.1            
[103] promises_1.2.1         xtable_1.8-4           cluster_2.1.6          evaluate_0.23          magick_2.8.3           cli_3.6.2              compiler_4.3.1         rlang_1.1.3            crayon_1.5.2           future.apply_1.11.1    labeling_0.4.3         plyr_1.8.9             stringi_1.8.3          deldir_2.0-4           munsell_0.5.0          lazyeval_0.2.2         spatstat.geom_3.2-9    Matrix_1.6-5           RcppHNSW_0.6.0         hms_1.1.3              patchwork_1.2.0        bit64_4.0.5            future_1.33.1          shiny_1.8.0            ROCR_1.0-11            igraph_2.0.2           bslib_0.6.1            DEoptimR_1.1-3         bit_4.0.5             

References

Aitchison, J. 1982. “The Statistical Analysis of Compositional Data.” Journal of the Royal Statistical Society. Series B (Methodological) 44 (2): 139–77. https://www.jstor.org/stable/2345821.
Aitchison, John. 2005. “A Concise Guide to Compositional Data Analysis.” Research {Reports} or {Papers}. http://web.archive.org/web/20210423114827/http://ima.udg.edu/Activitats/CoDaWork05/.
Büttner, M., J. Ostner, C. L. Müller, F. J. Theis, and B. Schubert. 2021. scCODA Is a Bayesian Model for Compositional Single-Cell Data Analysis.” Nature Communications 12 (1): 6876. https://doi.org/10.1038/s41467-021-27150-6.
Morton, James T., Clarisse Marotz, Alex Washburne, Justin Silverman, Livia S. Zaramela, Anna Edlund, Karsten Zengler, and Rob Knight. 2019. “Establishing Microbial Composition Measurement Standards with Reference Frames.” Nature Communications 10 (1): 2719. https://doi.org/10.1038/s41467-019-10656-5.